Identifying Controversies in Wikipedia using Support Vector Machines
نویسنده
چکیده
Wikipedia is a very large and successful Web 2.0 example. As the number of Wikipedia articles and contributors grows at a very fast pace, there are also increasing disputes occurring among the contributors. As a result of disputes, many articles in Wikipedia are controversial. In this project, I propose a supervised learning model using Support Vector Machines (SVMs) to identify controversial articles in Wikipedia. The idea is to represent each article by a bag-of-word feature vector. Each value in this vector is a raw count of a word type appearing in the article. Experiments on real articles from Wikipedia show that the proposed approach can effectively identify controversial articles. Introduction and Motivation Using open source Web editing software (e.g. wiki), online community users can now edit, review and publish articles collaboratively. Among the large and more successful wiki sites is Wikipedia (Voss July 2005), the online encyclopedia which covers 16.6 million articles (both English and nonEnglish), 9.5 million users and 200 languages (Wikipedia a). As Wikipedia is growing very fast in both number and size, there is also a higher likelihood for disputes to occur among contributors. Disputes often happen in articles with controversial content, in which contributors have different or even opposite opinions. For example, “Iraq War” is one of the most controversial articles in Wikipedia. It attracts a lot of disputes among contributors because they have different standing points about the war. Some people support it while some others strongly oppose to it. They also have difficulties in agreeing on different facts of the war. Additionally, disputes can be caused by “defensive” contributors who always argue to defend their ideas even when they are incorrect. As a result of disputes, many articles in Wikipedia are controversial. In this project, I aim to identify controversial articles (controversies for short), which is important due to the following two reasons. First, controversies appearing in Wikipedia articles are often a good reflection or documentation of the real world. Finding controversies in Wikipedia can therefore help the general public and scholars to understand the corresponding real world controversies better. Second, It allows moderators and contributors to quickly identify highly Copyright c © 2009, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. controversial articles, thereby improving the effectiveness of the dispute resolution process by reducing the amount of effort searching for such articles. However, determining controversies in Wikipedia is a great challenge. First, there is a huge number of articles, which makes it impossible to manually look at each article and identify controversies. Second, the articles cover a wide range of topics that require a lot of background knowledge for identifying controversies. Third, articles are growing very fast and that makes any results ever obtained be soon outdated. There have been several approaches related to determining controversies in Wikipedia. Wikipedia currently lets users to assign controversial tags (Wikipedia b) to articles to signal its controversies. Some other works, including (Lim et al. 2006; Hu et al. 2007b; 2007a; Vuong et al. 2008; Adler and de Alfaro 2007; Kittur et al. 2007) focus on developing unsupervised models to rank articles’ qualities and controversies in Wikipedia. However, these approaches are either inefficient or achieving very low accuracy. In this project, I propose an automatic approach for identifying controversies in Wikipedia. In particular, I represent each article as a bag-of-word vector in which each value is the raw count of a word type. I then apply a supervised learning model based on SVMs to learn models for detecting controversies. These learned models are subsequently used to classify unlabeled articles. Experiments on real articles from Wikipdia show a great promise of this solution in detecting controversies.
منابع مشابه
A Comparative Study of Extreme Learning Machines and Support Vector Machines in Prediction of Sediment Transport in Open Channels
The limiting velocity in open channels to prevent long-term sedimentation is predicted in this paper using a powerful soft computing technique known as Extreme Learning Machines (ELM). The ELM is a single Layer Feed-forward Neural Network (SLFNN) with a high level of training speed. The dimensionless parameter of limiting velocity which is known as the densimetric Froude number (Fr) is predicte...
متن کاملSTAGE-DISCHARGE MODELING USING SUPPORT VECTOR MACHINES
Establishment of rating curves are often required by the hydrologists for flow estimates in the streams, rivers etc. Measurement of discharge in a river is a time-consuming, expensive, and difficult process and the conventional approach of regression analysis of stage-discharge relation does not provide encouraging results especially during the floods. P
متن کاملMining Biological Repetitive Sequences Using Support Vector Machines and Fuzzy SVM
Structural repetitive subsequences are most important portion of biological sequences, which play crucial roles on corresponding sequence’s fold and functionality. Biggest class of the repetitive subsequences is “Transposable Elements” which has its own sub-classes upon contexts’ structures. Many researches have been performed to criticality determine the structure and function of repetitiv...
متن کاملFace Recognition using Eigenfaces , PCA and Supprot Vector Machines
This paper is based on a combination of the principal component analysis (PCA), eigenface and support vector machines. Using N-fold method and with respect to the value of N, any person’s face images are divided into two sections. As a result, vectors of training features and test features are obtain ed. Classification precision and accuracy was examined with three different types of kernel and...
متن کاملPredicting cardiac arrhythmia on ECG signal using an ensemble of optimal multicore support vector machines
The use of artificial intelligence in the process of diagnosing heart disease has been considered by researchers for many years. In this paper, an efficient method for selecting appropriate features extracted from electrocardiogram (ECG) signals, based on a genetic algorithm for use in an ensemble multi-kernel support vector machine classifiers, each of which is based on an optimized genetic al...
متن کامل